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ABSTRACT 

Many sensors/meters are deployed in commercial buildings 
to monitor and optimize their performance. However, be¬ 
cause sensor metadata is inconsistent across bnildings, software- 
based solutions are tightly coupled to the sensor metadata 
conventions (i.e. schemas and naming) for each building. 
Running the same software across buildings requires signif¬ 
icant integration effort. 

Metadata normalization is critical for scaling the deploy¬ 
ment process and allows us to decouple building-specific con¬ 
ventions from the code written for building applications. It 
also allows us to deal with missing metadata. One impor¬ 
tant aspect of normalization is to differentiate sensors by 
the type of phenomena being observed. In this paper, we 
propose a general, simple, yet effective classification scheme 
to differentiate sensors in buildings by type. We perform 
ensemble learning on data collected from over 2000 sensor 
streams in two buildings. Our approach is able to achieve 
more than 92% accuracy for classification within buildings 
and more than 82% accuracy for across buildings. We also 
introduce a method for identifying potential misclassified 
streams. This is important because it allows us to iden¬ 
tify opportunities to attain more input from experts - input 
that could help improve classification accuracy when ground 
truth is unavailable. We show that by adjusting a threshold 
value we are able to identify at least 30% of the misclassified 
instances. 

Categories and Subject Descriptors 

C.3 [Special-Purpose And Application-Based Systems]: 

Real-time and embedded systems 

General Terms 

Performance, Experimentation, Verification 

Keywords 

Sensor Type, Random Forest, Classification 


1. INTRODUCTION 

Commercial buildings are sites of large sensor/meter de¬ 
ployments used to monitor and optimize their performance. 
With the recent interest in reducing building energy con¬ 
sumption and increasing their efficiency, it is important to 
consider ways to quickly bootstrap a set of building data 
streams into an analytical pipeline, such as overall building 
efficiency or comfort-assessment analytics and control. How¬ 
ever, because sensor metadata is inconsistent across build¬ 
ings, software-based solutions are tightly coupled to the sen¬ 
sor metadata conventions (i.e. schemas and naming) for 
each building. Running the same software across buildings 
requires significant integration effort. 

Current ‘point’ naming conventions and unsystematic record¬ 
ing of metadata form a bottleneck in deployment scalabil¬ 
ity for analytics jobs. A ‘point’ refers to a physical loca¬ 
tion where a sensor is taking measurements. Each building 
vendor uses their own naming scheme and unique variants 
of each scheme are implemented from building to building; 
variations exist even across buildings that have contracted 
the same vendor. In addition, expanded descriptive informa¬ 
tion about the point is sometimes unavailable - so determin¬ 
ing their meaning is painfully slow or impossible. Because 
these are conventions carried out by humans, they are in¬ 
consistent within and across building data sets. This makes 
the integration process laborious for building experts and a 
non-starter for non experts. The process is fundamentally 
unscalable. 

Consider a simple analysis program, which has the ability 
to identify anomalous readings from a specific kind of sen¬ 
sor. To execute this job, the process organizes each sensor 
by type and location, generates the distribution of readings 
across them, and identifies broken sensors where some frac¬ 
tion of their readings are above some threshold value on the 
distribution. The identification step in the process is the 
most challenging because of the problems described. Ideally, 
the program would search for points the way you search for 
web pages in a search engine - using semantically meaningful 
terminology. 

Point names contain set of codes that are semantically mean¬ 
ingful to the building manager of a specific building. For 
example, the point BLDA1R435_ART is constructed as a con¬ 

catenation of such codes. The name of the building (first 4 
characters), the air handling unit identifier (the fifth char¬ 
acter), the room number (R435), and the type ART (area 


room temperature) - which indicates that this are measure¬ 
ment is produced by a temperature sensor. In addition to 
point names, there may be some descriptive metadata. The 
description for this point (if it exists) could describe that 
this is a “temperature sensor in room 435”. However, since 
point names do not follow the exact same structure within 
and across buildings (and certainly do not follow the same 
convention across vendors) no single approach could solve 
the normalization problem. A suite of approaches is neces¬ 
sary. 

Metadata normalization is critical for scaling the software 
deployment process. It allows us to decouple building-specific 
conventions from the code written for building applications. 
Normalization allows us to boost existing metadata, correct 
incorrect metadata, or generate common metadata when it 
is missing altogether. One such component in the normal¬ 
ization suite should differentiate sensor feeds by type. For 
example, we should be able to differentiate between sensors 
measuring temperature from sensors measuring pressure. In 
addition, we should be able to use what we learn from one 
building and apply it to another. This is especially useful 
in cases where similar stream types are labeled differently, 
labeled incorrectly, or not labeled at all. 

Normalization would allow us to quickly run jobs across 
many sites by enabling wide searchability of points across 
many buildings at once. In order to meaningfully deal with 
disparate building streams in a scalable fashion the streams 
should be searchable across various properties, such as build¬ 
ing name, room location, and type. Moreover, we assert 
that wide searchability is necessary for achieving scalability. 
By providing a tool for searching across building streams, 
we minimize the deployment time for applications; allowing 
them to be used in all buildings, not just a single one. 

One of the important aspect of the sensor meta/data that we 
can leverage are the actual patterns in the readings them¬ 
selves. Deep inspection of features in the data can yield 
meaningful results about the type of data that it is and 
can help us with the label normalization problem. This 
paper examines this path using standard machine learning 
approaches. We observe that statistical features over small 
time windows can be used to identify the stream type. More¬ 
over, we show that the classification of stream-type can be 
achieved using an ensemble of classifiers which is known to 
outperform a single classiher. 

We conduct a comprehensive study on the data collected 
from over 2000 sensors in two separate buildings on two 
campuses. Our main contributions are: 

• We propose a simple, general yet effective feature ex¬ 
traction scheme to achieve sensor type classification in 
the context of commercial buildings. 

• We formulate an approach to identifying potential mis- 
classihed sensor streams (in terms of the type classes) 
when no ground truth labels are available. 

• We evaluate our classihcation technique using data 
from over 2000 sensor series of 6 types in two buildings 
on two campuses, and our technique is able to achieve 


around 92% and 98% accuracy when doing classihca¬ 
tion within each building, and around 82% accuracy 
when inferring type information across buildings. 

• We also evaluate our solution to misclassihcation iden- 
tihcation and the results demonstrate that we are able 
to identify at least 30% percent of the target popula¬ 
tion by choosing an optimal threshold for decision. 

We believe this is an important study given the recent trends 
in the penetration of the internet of things into our homes 
and environments. Studies show that normalization is an 
especially pernicious and widely ubiquitous problem in em¬ 
bedded systems, with only 7% of data tagged and only 1% 
analyzed |15| . Our technique can be used to unify that data 
across many deployment and enable broad search and ex¬ 
ploration of new applications. For example, sensing device 
names for the internet of things are likely to follow similar 
conventions with very little context. We argue that unih- 
cation through boosting will be necessary in this broader 
domain. 

2. METHODOLOGY 

In this section, we describe the design and construction of 
the feature-vector we use to characterize sensor type. We 
explain what it captures, fundamentally and, hence, why it 
works so well for building sensor data. Then, we discuss 
the classification technique we apply and give a detailed de¬ 
scription of the training and testing process. Finally, we 
articulate a solution for identifying potentially misclassihed 
streams, when no type-label ground truth is available. 

2.1 Feature Extraction 

Raw sensor time serie^ usually contain millions of readings 
which are too general to be useful for type classification. 
We need to distill the information embedded in the reading 
patterns. A signal in the time domain trends the amplitude 
of a sensor reading and different types of sensor generally 
occupy distinct amplitude bins, as demonstrated in Figure[T] 
We can characterize the amplitude distribution of a signal 
in the time domain by using the percentiles of the value 
distribution.To identify outliers in the distribution, we pick 
the 50th percentile value (also known as the median) as a 
discriminator, which is more robust to outliers skew than 
the average. 

Naturally, sensor reading value-ranges may overlap. For ex¬ 
ample, during a rainy season, the humidity in an office can 
reach the range of 60-70 (percent) which is the same as typi¬ 
cal temperature sensor readings (Fahrenheit). If you do not 
consider measurement units, the distributions for each of the 
two types look similar. Simply relying on percentiles is not 
sufficient for differentiating sensor types. Figure [2] demon¬ 
strates this point. To capture their difference we need to 
include the variance of the signal in our feature-vector. 

When we extract features from a raw sensor readings, the 
original trace can span hours, days, or weeks, and the trend 
can vary significantly, even from hour to hour. Extract¬ 
ing certain features, such as percentiles, and variances over 

^In this paper, we use the term “trace”, “readings” and “time 
series” interchangeably. 
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Figure 1: Different types of sensors occupy different amplitude bins in the time domain with different short term dynamics. 
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a statistical summary of the two vectors. For each vector we 
compute the minimum, maximum, median and variance, re¬ 
sulting in a feature vector with eight variables: 

F = {min{MED),max{MED),median{MED),var{MED), 
min{V AR), max {V AR),median{V AR),var{V AR)} 


(a) Room Temperature (b) Humidity 

Figure 2: An example of two different types of sensors oc¬ 
cupying the same amplitude bin. 

the entire sensor trace might miss short term dynamics thus 
missing discriminating characteristics. In contrast, comput¬ 
ing features over short time windows can produce too much 
noise. Too many feature variables typically degrades classi¬ 
fier performance. To succinctly summarize the dynamics of 
sensor traces, we apply feature extraction to every time win¬ 
dow of fixed length on the original trace and compute the 
statistics of the accumulated features from windowed slices 
as the final feature set. 


And F is the feature vector for each sensor trace used in our 
classification process. 

2.2 Classification 

In general, ensemble learning methods obtain better pre¬ 
dictive performance than any of the constituent learning 
methods as discussed in [191 I22j . if the following assump¬ 
tions hold [8]: 1) the probability of a correct classification 
by each individual classifier is greater than 0.5 and 2) the 
errors in predictions of the base-level classifiers are indepen¬ 
dent. Random forests [4] have been widely used and out¬ 
perform a single tree classifier. They are also faster [29] in 
training and testing compared to traditional classifiers such 
as SVM. The notion of randomized trees was hrst introduced 
in [2] and further well developed in [^. 


Upon close inspection of the traces, we notice that the short¬ 
term dynamics of the phenomena being measured, could 
be used to differentiate them. Therefore, the distribution 
of short-term summary statistics can be used discriminate 
traces by type. We construct our feature vector as fol¬ 
lows: first, each single sensor signal is segmented into N 
non-overlapping 45-minute long windows (we will discuss the 
decision of window length in later section). Second, within 
each time window, we compute the median and variance of 
the signal, producing a vector of medians and a vector of 
variances after the window slides over an entire trace: 

MED = {median^, median?^median^} 

VAR = {variance^^ variance ^,..., variance^} 

Where N is the number of time windows. The vector MED 
and VAR reflect short term changes but not all the interme¬ 
diate values are useful for classification. Finally, we compute 


Random forests construct a multiple of classihcation trees. 
To classify an unlabeled object, we construct the feature 
vector and “feed” the vector down each of the trees in the 
forest. Each tree gives a classihcation and we use it as a 
“vote” for that class. The forest chooses the class having 
the most votes over all the trees in the forest. The process 
proceeds as follows: 

1. Sample N instances at random with replacemenlQ, from 
the original data set. These samples will be the train¬ 
ing set for growing this particular tree. 

2. Specify M feature variables at random out of the total 
feature vector when growing each node of a tree. And 
the best split (measured by the information gain) on 

^An element may appear multiple times in the sample set. 


















































these M is used to split the node. The value of M is 
constant during the forest growing. 

3. Each tree is grown to the largest extent possible with¬ 
out pruning. 


The randomness of this ensemble learning method occurs in 
the first two steps. We set N equal the number of instances 
in the original training set, M equal the square root of the 
number of original feature variables, and the number of the 
trees in the forest be 50. Usually these parameters are opti¬ 
mized through cross-validation and we refer interested read¬ 
ers to [4] for further deduction and proof of random forest. 

All the instances in our data set are labeled with ground 
truth class. To train a random forest, we split the origi¬ 
nal set into two subsets, one for building a forest and one 
for testing the accuracy of the classifier. After the forest is 
built from training set, we learn the posterior probabilities 
of each class c at each leaf node I for each tree t: suppose 
that T is the set of all trees, C is the set of all classes and 
L is the set of all leaves for a given tree t £ T. In the 
training stage the posterior probabilities Pt,iiY{i) = c)) for 
each class c € C at each leaf I £ L, are learned for each 
tree t £ T. These probabilities are calculated as the ratio of 
the number of instance i of class c that reach I to the total 
number of instances that reach 1. Y{i) is the class label for 
instance i. To classify an instance in the testing stage, the 
feature vector of an instance is passed down each tree until 
reaching a leaf node, which gives a probability distribution. 
All the posterior probabilities accumulated from each tree 
are then averaged and the argmax is taken as the class of 
the instance. Note that instead of letting each tree vote for 
on class as described in the original paper [4], we combine 
the results from classifiers for an instance by averaging their 
probabilistic predictions, in order to facilitate our technique 
used to help identify potential misclassified instances as de¬ 
scribed in the following section. 


2.3 Quantify Classification Uncertainty 

Being able to measure the confidence of prediction results 
and to identify potential misclassifications, is vital to a learn¬ 
ing process. It is trivial to identify misclassification when 
ground truth is available, but in many real-world cases ground 
truth is unavailable. Quantifying classification uncertainty 
can help identify potential misclassified instances and presents 
an opportunity to solicit the user for feedback that we can 
use to improve our results. To quantify the uncertainty of 
classifications in our learning process, we use the posterior 
probabilities learned in the random forest. 


With the learned posterior probabilities for each class, at 
each leaf in each tree, we can compute the average proba¬ 
bilities for each class as follows: 


P{Y{i) = c) 


j:tPtAY{i) = c) 

\r\ 


Where T is the collection of trees in the forest where Pt,i{Y{i) = 
c) 7 ^ 0, and | • | denotes the cardinality of a set. Given these 
averaged probabilities for each class, the forest produces a 
vector of class probabilities for each new instance as: 

Pr = {P{Y{i) = c)},c£C 


Suppose we have a probability vector Pri = {0.9, 0, 0, 0, 0,0.1} 
for instance and another vector Pr 2 = {0.3,0.25,0.1,0,0.15,0.2} 
for instance i 2 - Both ii and 12 will be assigned to the same 
class according to the class probability distribution, but the 
assignment of 12 is less confident compared to that of ii be¬ 
cause its predicted class probability has a less “concentrated” 
distribution. To measure the degree of “uncertainty” in clas¬ 
sification of one instance, we compute the entropy of its class 
probability yielded by the forest. We rank the classification 
results and filter out the instances for further manual in¬ 
spection whose entropy are above a threshold. Inspection 
can help eliminate misclassifications. 

3. EVALUATION 

To demonstrate the effectiveness of our methodology, we 
evaluate our classification technique in two different scenar¬ 
ios: a) intra-building, that is, the training and testing data 
for classification is taken from the same building, and b) 
inter-building, where the training and testing instances are 
from two distinct buildings. We also discuss how the amount 
of training instances and the window size of segment affect 
the performance of classification. At last, we analyze the 
results of our solution for identifying potential misclassifica¬ 
tions. 


3.1 Taxonomy 

Most of the sensing points in the building can be classi¬ 
fied into 6 general types, which we use in our work. In 
this paper, we consider 6 types of sensors: CO 2 , humidity, 
room temperature, setpoint, air flow volume, other tempera¬ 
ture. Room temperature includes only sensors that measure 
the air temperature of rooms (as “room temp” in Table [TJ 
and other temperature (as “other temp” in Table [TJ incor¬ 
porates all other temperature measurements involved in an 
air ventilation system (illustrated in Figur43[3j such as sup¬ 
ply air/return air/mixed air temperature and chilled or hot 
supply water/return water temperature. For set points, we 
assign only one general type which includes all set points for 
every actuator configured in the building. 



Figure 3: A typical HVAC system consisting of water-based 
heating/cooling pipes and air circulation ducts. 

3.2 Experimental Setup 

We collected a week’s worth of data from two separate build¬ 
ings on two campuses. One is from the Rice Hall at the 

^Included with permission from the authors of [3]. 












































University of Virginia, where the sense points report to a 
database [25] anywhere between every 10 seconds to every 
10 minutes. The other building is the Sutardja Dai Hall 
(SDH) at UC Berkeley, where the deployed sensors |121 [1] 
transmit to an archiver periodically from anywhere be¬ 
tween every 5 seconds to every 10 minutes. The number for 
each type of sensor in each building is summarized in Table [T] 
and the type ground truth for each sense point is manually 
expanded based on the metadata in the database. 


Type 

Rice 

SDH 

CO 2 

16 

52 

Humidity 

48 

52 

Room temp 

142 

216 

Setpoint 

265 

819 

Air volume 

12 

158 

Other temp 

119 

37 

Sum 

602 

1334 


Table 1: Number of Sensors by Type 

All of our learning and classification processes are imple¬ 
mented based on the scikit-learn m library, which is an 
open-source machine learning package implemented mostly 
in Python providing a rich set of APIs. 

3.3 Baseline and Metrics 

As a baseline to compare our proposed approach against, 
we adopt a simple feature extraction scheme for each trace 
F = {medyVar}, where med and var is simply the median 
and variance computed over the entire trace. 

For classification, we measure the averaged cross-validation 
accuracy in two different scenarios (intra- and inter- build¬ 
ings). In the intra-building case, the data from a single 
building is split into training and testing sets, where the re¬ 
sults illustrate how accurately the type information can be 
inferred using local within-building information. For inter¬ 
building case, the experiment performs training and testing 
across buildings, i.e, train the classifier on the data from 
building A and test it on building B. This set of experi¬ 
ments tests how well we can apply the classification bound¬ 
aries from one building and apply it to another. 

For identifying potential misclassifications, we choose the 
true-positive rate (TPR, also known as recall), false-positive 
rate (FPR, also known as fall-out) and positive predictive 
value (PPV, also known as precision) as metrics to eval¬ 
uate the performance of our entropy-based approach when 
making different choice of threshold. Particularly, under our 
misclassification identification context, a true-positive (TP) 
is when an instance considered to be misclassified is really 
misclassified while a false-positive (FP) is when an instance 
considered to be misclassified is instead a correct classifica¬ 
tion. 

3.4 Classification Accuracy 

We run the two sets of experiments described above, i.e, 
the intra- and inter- building tests, to examine the effective¬ 
ness of feature design and measure how well the classifier 
performs. The classification results are summarized from 
Table EE In the table, each row is specific to a type and 


each column is the percentage of the full data set that was 
used for training. Each cell shows two values. The value 
without parentheses is the average classification accuracy 
for the richer feature-vector. The value in parentheses is the 
average classification accuracy for the approach described in 
Section im These are compared throughout the table. The 
last column summarizes “leave-one-out” cross- validatiorQ re¬ 
sults for each approach. 

3.4.1 Intra Building Performance 

From the last column in Table (2) a-nd jS] we see that type 
classification in a single building achieves accuracy of ~92% 
and ~98% on Rice Hall and SDH respectively, for leave-one- 
out (LOO) cross validation. The accuracy for the baseline 
is also shown in the table (in parentheses). The only type 
we have difficulty differentiating is “other temp”, which in¬ 
cludes temperature measurements for air and water in the 
ventilation system, and particularly, the return air temper¬ 
ature measurements (as illustrated in Figure |3|) are almost 
identical to the ones measuring air temperature in rooms be¬ 
cause what the return duct exhausts is mostly the air from 
a room. 

3.4.2 Inter Building Performance 

This set of experiments illustrates how accurately we can 
learn the type information of one building based on the 
knowledge from another building. The overall classification 
accuracy achieved for the two buildings by training on the 
entire data set (train on SDH for Rice and train on Rice for 
SDH) is around 82%, as seen from the last columns in Ta¬ 
ble [4] and [5| Particularly, we see that the accuracy for “other 
temperature” in Rice is abnormal compared to the rest of 
the results. The issue with classifying “other temperature” 
in Rice is that there are many sensing points measuring the 
temperature of supply and return cold/hot water utilized 
in the ventilation system, which are absent in the Berkeley 
building as a training set. Therefore the feature of these 
traces cannot be learned from SDH and causes problems in 
classifying these traces. 

3.5 Learning Bootstrapping 

We also experiment with different amount of training in¬ 
stances to examine how that affects the classification accu¬ 
racy, which gives some insight on how many instances are 
needed to bootstrap the classification process. In Table[2}[5| 
the last columns demonstrate how accurately we can do clas¬ 
sification on average. There also remains the question of how 
many number of instances we need to bootstrap the learning 
process in both of the intra- and inter- building cases. To 
examine the impact of number of instances on classification 
accuracy, we use different percentage of the original data 
set as training set, i.e, 5%, 10%, 20%, 33%, 50%, and the 
results are presented in the first five columns in each table. 
For each percentage of training instances used, we apply 
stratified sampling on the original set and the remaining 
instances are used as testing set. We repeat the same per¬ 
centage 1/percentage times to reduce random errors and get 

^In LOO cross validation, each training set takes all the 
instances except one with the test set being the sample held 
out. 

®The sampled set contains the same percentage of samples 
of each class as the original complete set. 








Type 

5% 

10% 

20% 

33% 

50% 

LOO 

CO 2 

51.3 (60.7) 

83.7 (94.0) 

98.4 (98.5) 

100.0 (100.0) 

93.8 (100.0) 

93.8 (100.0) 

Humidity 

59.6 (61.8) 

66.8 (67.9) 

80.8 (77.1) 

82.3 (74.3) 

87.5 (83.8) 

83.3 (89.8) 

Room temp 

89.0 (88.5) 

93.0 (94.0) 

95.6 (94.7) 

93.3 (95.9) 

97.2 (93.7) 

95.1 (95.6) 

Setpoint 

97.0 (93.1) 

97.5 (95.2) 

99.2 (95.7) 

99.2 (96.5) 

98.5 (97.4) 

99.2 (97.8) 

Air volume 

22.2 (21.8) 

35.5 (30.5) 

46.7 (49.6) 

79.2 (66.7) 

41.7 (83.3) 

83.3 (75.0) 

Other temp 

54.7 (47.2) 

64.8 (59.0) 

70.0 (71.0) 

72.7 (75.4) 

74.7 (71.7) 

74.8 (83.1) 

Overall 

80.4 (78.3) 

85.9 (84.4) 

90.0 (88.3) 

90.9 (90.0) 

91.4 (90.2) 

91.7 (93.3) 


Table 2: Intra-building Classification Accuracy for Rice Hall 


Type 

5% 

10% 

20% 

33% 

50% 

LOO 

CO 2 

80.4 (63.4) 

87.8 (75.8) 

91.4 (74.6) 

89.5 (83.7) 

92.3 (86.5) 

96.2 (76.9) 

Humidity 

91.6 (92.4) 

94.4 (92.1) 

97.6 (95.2) 

98.1 (98.1) 

100.0 (100.0) 

98.1 (98.1) 

Room temp 

98.3 (96.0) 

98.9 (96.3) 

99.2 (95.6) 

98.4 (94.7) 

97.7 (93.1) 

99.1 (95.8) 

Setpoint 

99.2 (89.0) 

99.6 (90.4) 

99.5 (91.4) 

99.7 (91.8) 

99.5 (90.7) 

99.5 (93.3) 

Air volume 

78.4 (41.8) 

87.1 (47.4) 

92.7 (52.9) 

96.8 (57.1) 

98.7 (55.3) 

97.5 (57.1) 

Other temp 

23.7 (19.4) 

38.4 (28.7) 

62.3 (36.5) 

68.9 (48.7) 

75.7 (59.9) 

73.0 (59.5) 

Overall 

93.4 (81.4) 

95.6 (83.7) 

97.2 (85.2) 

97.8 (86.6) 

98.2 (86.0) 

98.3 (87.7) 


Table 3: Intra-building Classification Accuracy for SDH 


Type 

5% 

10% 

20% 

33% 

50% 

100% 

CO 2 

29.7 (44.4) 

45.6 (56.9) 

75.0 (75.0) 

93.8 (72.9) 

93.8 (75.0) 

87.5 (93.8) 

Humidity 

50.9 (30.5) 

72.1 (26.7) 

76.2 (28.6) 

76.4 (21.1) 

89.6 (16.3) 

87.5 (26.5) 

Room temp 

97.6 (92.7) 

99.4 (91.4) 

100.0 (90.9) 

97.2 (92.5) 

100.0 (93.1) 

100.0 (91.8) 

Setpoint 

97.8 (94.7) 

98.2 (94.6) 

98.0 (92.8) 

97.7 (91.4) 

97.5 (92.3) 

98.9 (92.6) 

Air volume 

57.5 (21.2) 

58.3 (18.3) 

66.7 (23.3) 

63.9 (25.0) 

70.8 (20.8) 

83.3 (25.0) 

Other temp 

5.3 (5.8) 

10.8 (6.2) 

11.1 (6.9) 

16.8 (10.2) 

18.9 (10.5) 

19.3 (12.9) 

Overall 

73.1 (69.1) 

76.9 (68.7) 

78.3 (68.7) 

79.1 (68.5) 

81.3 (68.7) 

81.9 (70.3) 


Table 4: Inter-building Classification Accuracy for Rice Hall 


Type 

5% 

10% 

20% 

33% 

50% 

100% 

CO 2 

63.5 (94.2) 

96.9 (96.2) 

90.8 (96.9) 

93.6 (94.9) 

92.3 (95.2) 

98.1 (98.1) 

Humidity 

67.4 (28.8) 

86.3 (47.4) 

98.1 (45.0) 

96.2 (41.0) 

98.1 (25) 

98.1 (44.2) 

Room temp 

78.0 (78.0) 

78.2 (75.8) 

72.9 (73.8) 

77.9 (76.7) 

80.3 (77.3) 

53.2 (77.3) 

Setpoint 

77.4 (53.3) 

83.3 (50.8) 

86.5 (53.4) 

87.9 (54.4) 

87.2 (62.0) 

91.8 (58.1) 

Air volume 

13.8 (34.7) 

15.2 (33.1) 

37.8 (25.1) 

42.4 (32.9) 

50.3 (30.3) 

71.5 (38.8) 

Other temp 

48.3 (51.4) 

49.7 (53.1) 

58.4 (4.6) 

45.0 (52.3) 

45.9 (54.1) 

67.6 (51.4) 

Overall 

68.2 (55.5) 

74.1 (54.3) 

78.4 (54.5) 

80.3 (56.2) 

81.2 (60.1) 

83.0 (59.6) 


Table 5: Inter-building Classification Accuracy for SDH 

Each table shows the averaged classification accuracy of experiments where different percentage of the complete set is used as 
training set (denoted as ‘X%’). In the percentage analysis, each percentage is repeated 1/percentage times and the averaged 
accuracy is presented. LOO cross validation accuracy is also shown for the intra-building test case. On average, our solution 
outperforms the baseline approach (shown in parentheses). 


an averaged accuracy for that percentage. We can clearly 
see a trend that more training instances yield better classifi¬ 
cation results in all cases. However, we can also notice that 
after the training set includes about 20% of the complete set 
(which is -^120 instances and ~260 instances for Rice and 
SDH respectively) the accuracy doesn’t increase too much 
even reaching 100% of the complete set. This indicates that 
we don’t need too many instances to bootstrap the learning 
process within or across buildings to accomplish sensor type 
classification tasks. 


3.6 Window Size Sensitivity 

All the classification results, thus far, were obtained using 
features extracted in 45-minute window slices on the original 
sensor traces. We study how different windows sizes affect 
the classification performance. Figure [4] shows those results. 
The intra case performs LOO cross validation while inter 
case runs 10-fold cross validation. For the intra-building 
case, the classification is not sensitive to different window 
sizes as seen in the figure: basically, accuracy stays almost 
the same for both buildings because within the same build¬ 
ing, as long as we can capture the short term characteristics 



































































Figure 4: Classification accuracy of intra- and inter- building 
cases against different size of time window: a window size of 
45 minutes is optimal for our classification tasks. 


of sensor dynamics in the windowed time slots, the size of the 
time window doesn’t make too much difference. However, 
for the inter-building case, the time window size matters in 
the way that usually the local micro-climate in one build¬ 
ing can be quite different from another, we need to “tune” 
this common short term window to capture the dynamics 
that can be used to learn type-related information across 
buildings. Therefore, in order to achieve decent type classi¬ 
fication accuracy across buildings, (i.e, use information from 
one building to help classify the traces in another building), 
we still need to optimize the size of time window, which is 
45 minutes in our case. This “tuning” is significant for the 
learning process across buildings and is straightforward to 
perform. 


3.7 Identifying Potential Misclassifications 

As we discussed in early section, being able to quantify 
the confidence in classifications and identify misclassified in¬ 
stances in our sensor type classification is vital to improving 
the overall accuracy considering that in many cases our tech¬ 
nique is used there will be the absence of ground truth. As 
an intermediate step to identify potentially misclassified in¬ 
stances, we propose to quantify the “uncertainty” of classifi¬ 
cation with an entropy-based approach described in Section 
3.3. Figure [5] shows the CDF of class probability entropy 
of classification in the intra- and inter- building scenarios. 
We see that the collection of correct classifications (in solid 
lines) has a distinct distribution from the collection of mis- 
classification (in dotted lines). Based on such distinction in 
the distribution, we can choose a certain entropy value as 
a threshold and filter out all the classified instances whose 
class probability entropy are greater than the threshold out¬ 
putted by the forest. Figure [ 6 ] gives a summary of the per¬ 
formance of our entropy-based approach to identifying po¬ 
tential misclassification. Here are some definitions needed 
to understand the statistics: 

^i: the set of instance whose class probability entropy is 
greater than the threshold. 

Si'- the set of instance falling in Si that is misclassified in 
the classification process. 


S 3 : the set of instance falling in Si that is correctly classified 
in the classification process. 

S4: the set of instance that is misclassified in the classifica¬ 
tion process. 

S 5 : the set of instance that is correctly classified in the 
classification process. 

And the TPR, FPR and PPV are defined as: 

TPfl={||, FP/!={||, ppr-||. 

Where | • | is the cardinality of a set. We see that as the 
threshold value increases, both of the TPR (recall) and FPR 
(fall-out) decrease while the PPV (precision) keeps increas¬ 
ing. 

In our case, a smaller threshold essentially leads to a larger 
population of instances being filtered out as potential mis¬ 
classification “candidates”, which helps identify more real 
misclassified instances. However, the more candidates we 
filter our, the more instances we need to manually inspect, 
which inevitably leads to a lower precision of the identi¬ 
fication process. So we want to strike a balance between 
achieving a high recall rate as well as maintaining a high pre¬ 
cision. As a result, based on the observation from Figure[ 6 l 
we suggest picking a threshold value somewhere between 0.4 
and 0.45 is appropriate. To note, we have 50 and 13 mis¬ 
classified instances for Rice and SDH respectively for the 
intra-building testing case. In the intra-building case, such 
a threshold (0.4-0.45) helps identify -^30% of the misclassi¬ 
fied instances for Rice and ~50% for SDH while resulting 
in that ~70% and ~50% of the instances being manual in¬ 
spected are actually correct classifications, for Rice and SDH 
respectively. As for the inter-building case, our approach is 
able to identify ~75% of the misclassified instances for both 
Rice and SDH with an overhead of ~40% and ~70% in the 
candidate inspection, for Rice and SDH respectively. 

4. DISCUSSION 

There are several aspects of our work that we left out or did 
not have time to explore more deeply. First we go over the 
expansion of type classes and how we could increase cover¬ 
age of sensor types in future work. We discuss how we could 
improve classification accuracy by looking for data sources 
outside the building data sets. We also discuss why principal 
component analysis is an aspect that we did not explore in 
depth and how the principal components can change from 
building to building. Finally, we explain how our misclas¬ 
sification identifier could be used to improve classification 
results. 

4.1 Extension of Taxonomy and Class Scope 

Our taxonomy covers 5 specific and one general sensor type. 
We could extend the class scope to include more sensor types 
and make our technique more versatile. There are many 
types of sensors in modern buildings and the sensing fabric 
in smart buildings continue to diversify, e.g., occupancy sen¬ 
sors, light sensors, etc. We also want to build a deeper tax¬ 
onomy for certain types. For instance, there are set points 
for very different actuators. Temperature set-points drive 
the HVAC system, while the air quality set-point drives the 












(a) Intra-building Classification 


(b) Inter-building Classification 


Figure 5: The CDF curves depict the class probability entropy distribution for the collection of correct (‘-C’ in solid lines) and 
wrong (‘-W’ in dotted lines) classifications in the intra- and inter- building test cases. The collection of correct classifications 
has a distinct distribution from the collection of wrong ones. 




(a) Intra-building Performance 


(b) Inter-building Performance 


Figure 6: The ROC curves depict the sensitivity of misclassification identification to different entropy threshold value. Choos¬ 
ing a threshold somewhere between 0.4 and 0.45 achieves the best compromise between recall and precision. 


filters and air mixers. Being able to differentiate between 
these can help enable general control applications in build¬ 
ings. 

4.2 Improvement on Classification Accuracy 

The learning and classification processes in our work relies 
only on a set of general features. However, we wish to ex¬ 
plore how using external or domain-specific knowledge could 
help improve the classification accuracy. For instance, if we 
know the humidity in rooms will increase due to a rain fore¬ 
cast, then we could search for traces with increases in aver¬ 
age in reading values as external knowledge to help identify 
humidity traces. 

4.3 Feature Importance and Selection 

In our study, we did not delve into the the importance of 
features (i.e. principle component analysis) because the fea¬ 
ture vector contains only eight variables. Therefore, doing 
classification in a hyperspace of only eight dimensions is not 
computationally expensive - even if some of the vector ele¬ 
ments carry redundant information. More importantly, that 


selecting the set of principle features for each building results 
in using a different feature set (as demonstrated in Table O 
per building. This makes classification across buildings im¬ 
possible. Still, evaluating the principle components and un¬ 
covering overlap is important for obtaining optimal classifi¬ 
cation performance for intra-building tasks and single-type 
analysis. 


4.4 Reducing Misclassification Iteratively 

In cases where no ground truth labels are available, an entropy- 
based approach can be used in an iterative manner to im¬ 
prove classification results. In each iteration, only a few ex¬ 
amples (on the top of the entropy-based “uncertainty” rank¬ 
ing list) are inspected and corrected, and the corrected in¬ 
stances could be added to the training set. The training and 
classification process is repeated until some criteria is sat¬ 
isfied. We expect the number of examples needed for man¬ 
ual inspection will be dramatically reduced in each iteration 
and overall, compared to a one-time inspection of candi¬ 
dates filtered by some threshold value. Such an interactive. 























































































Building 

Set of Best Features 

Acc. on All 

Acc. on Best Set 

Rice 

min(MED), med(MED), med(VAR), var(VAR) 

88.7% 

91.5% 

SDH 

min(MED), max(MED), max(VAR), med(VAR) 

97.1% 

97.8% 


Table 6: Classification accuracy on all the features and on the best set of features in intra-building test for each building: 
the best feature sets are obtained by exhausting all the feature combinations and running on a single decision tree with 
leave-one-out cross validation. The best feature set is different for each building. 


supervised learning process can produce better classification 
results and reduce the human labeling effort needed. 


5. RELATED WORK 

To the best of our knowledge, we are the first to approach 
the problem of sensor-type classification of physical data in 
buildings. We describe the closest, related work in different 
problem domains and describe work that uses the random 
forest as a tool. 

There has been much research work on type classification in 
the context of audio |141118|. music |101116|. video [3 [13, 
web query |111 and human activity |211 I27 |. The goal 
of [13] is to classify audios into categories such as speech, 
music, background sound and silence using support vector 
machines, and the work in m addresses the same prob¬ 
lem with a HMM-based statistical model. Examples of mu¬ 
sic genre (i.e, jazz, pop and so forth) classification are |101 
mi, which use GMM with EM algorithm and logistic regres¬ 
sion respectively. And commonly used features for these 
audio-related classification work are MFCC, zero crossing 
rate, energy/power and spectral/temporal statistics. For 
video type classification, texture and color-based features 
are used to classify videos into classes including cartoon, 
commercial, news and so on with decision tree |3 and neural 
network m- Query categorization has also been researched, 
[13 exploits a rule-based classifier while [^ uses a Markov 
random walk model. There is also work on human activity 
classification in general cases m (i-e, running, walking and 
sitting) and home setting [23 (i.e, sleeping, toiletting and 
showering) using accelerometer data with voting-based clas¬ 
sifier and HMM with conditional random fields respectively. 
In contrast, our work is focused on sensor type classification 
using ensemble learning technique. 

Random forests have been applied in many different areas [5] 
[2i[i3[2Q]. m uses a gene as a feature to classify microar¬ 
ray data. [2S] uses the intensity of hundreds of measured 
metabolites from medical subjects as features, to classify the 
subjects into groups of normal, diseased and diseased with 
drug treatment. Random forests have also been used in [13] 
to classify objects in images with image-relevant features. 
In the area of remote sensing, m utilizes user-defined pa¬ 
rameters as features to classify land cover types. In our 
work, we use simple and general statistical feature-set for 
type classification. 

There is work leveraging percentile-based features in time 
series data for different classification purposes. Tarzia [24] 
et. al use a certain percentile in the audio spectrum to 
classify the current room location. Wang [28] et. al utilize 
percentile-based features in audio to characterize occupancy 
and noise levels. For comparison, we use percentile statistics 


in sensor time series as part of our feature-set to differentiate 
between different sensor types in commercial buildings. 

6. CONCLUSION 

We describe a general, simple yet effective feature extrac¬ 
tion design in support of sensor type classification with time 
series data. By experimenting with over 2000 streams from 
two buildings on two campuses, our technique, which lever¬ 
ages an ensemble learning method, is able to achieve an ac¬ 
curacy more than 92% and 82% for testing within building 
and across buildings, respectively. We also discuss that how 
to choose the window size applied to a slice of the origi¬ 
nal time series and how the number of training instances 
affects classification accuracy. In general, around 100 in¬ 
stances are enough to bootstrap the learning process in the 
case of 6 types of sensors. Another important contribution 
of our paper is a probability-based solution for identifying 
potentially misclassified instances. With the use of proba¬ 
bilities produced by the random forest, in both of the intra- 
and inter- building learning cases, we are able to identify at 
least 30% of the misclassifications. 

Our technique can act as a tool for metadata construction 
for building sensors. For cases where type information of 
sensors is missing, our technique can help infer and generate 
the type metadata. In cases where metadata is available in 
an inconsistent manner within/across buildings, our solution 
can be used to verify type information and unify the naming 
schema across platforms in different buildings. Questions 
remain about how broadly we can expand our taxonomy 
and further study the scalability of our technique. 
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